Thera Bank Personal Loan Campaign

Data Description:

The dataset contains data on 5000 customers. The data include customer demographic information (age, income, etc.), the customer's relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers, only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

Context:

This case is about a bank (Thera Bank) whose management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget.

Attribute Information:

  • ID: Customer ID
  • Age: Customer's age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (\$000)
  • ZIP Code: Home Address ZIP
  • Family: Family size of the customer
  • CCAvg: Avg. spending on credit cards per month (\$000)
  • Education: Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (\$000)
  • Personal Loan: Did this customer accept the personal loan offered in the last campaign?
  • Securities Account: Does the customer have a securities account with the bank?
  • CD Account: Does the customer have a certificate of deposit (CD) account with the bank?
  • Online: Does the customer use internet banking facilities?
  • Credit card: Does the customer use a credit card issued by the bank?

Learning Outcomes:

  • Exploratory Data Analysis
  • Preparing the data to train a model
  • Training and making predictions using a classification model
  • Model evaluation

Objective:

The classification goal is to predict the likelihood of a liability customer buying personal loans.

Steps and tasks:

  1. Import the datasets and libraries, check datatype, statistical summary, shape, null values or incorrect imputation. (5 marks)
  2. EDA: Study the data distribution in each attribute and target variable, share your findings (20 marks)
    • Number of unique in each column?
    • Number of people with zero mortgage?
    • Number of people with zero credit card spending per month?
    • Value counts of all categorical columns.
    • Univariate and Bivariate
    • Get data model ready
  3. Split the data into training and test set in the ratio of 70:30 respectively (5 marks)
  4. Use the Logistic Regression model to predict whether the customer will take a personal loan or not. Print all the metrics related to evaluating the model performance (accuracy, recall, precision, f1score, and roc_auc_score). Draw a heatmap to display confusion matrix (15 marks)
  5. Find out coefficients of all the attributes and show the output in a data frame with column names? For test data show all the rows where the predicted class is not equal to the observed class. (10 marks)
  6. Give conclusion related to the Business understanding of your model? (5 marks)
In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

import plotly.express as px

from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(solver='newton-cg')

from sklearn.model_selection import train_test_split as tts

from sklearn.impute import SimpleImputer as si

from sklearn import metrics as mtr

import statsmodels.api as sm
In [2]:
import warnings
warnings.filterwarnings('ignore')

pd.options.display.float_format = '{:,.2f}'.format        

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

sns.set_style(style='darkgrid')
In [3]:
# Reading file
df = pd.read_csv('Bank_Personal_Loan_Modelling.csv')

# Renaming columns for ease of working with them
df.rename(columns={'ZIP Code':'Zipcode','Personal Loan':'AccptPersLoan','Securities Account':'HaveSecAcct','CD Account':'HaveCDAcct','Online':'HaveOnline','CreditCard':'HaveCC'},inplace=True)

#Checking data-types
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
ID               5000 non-null int64
Age              5000 non-null int64
Experience       5000 non-null int64
Income           5000 non-null int64
Zipcode          5000 non-null int64
Family           5000 non-null int64
CCAvg            5000 non-null float64
Education        5000 non-null int64
Mortgage         5000 non-null int64
AccptPersLoan    5000 non-null int64
HaveSecAcct      5000 non-null int64
HaveCDAcct       5000 non-null int64
HaveOnline       5000 non-null int64
HaveCC           5000 non-null int64
dtypes: float64(1), int64(13)
memory usage: 547.0 KB

Inference: Non-null 5000 row data-set with integer or float data-types


In [4]:
df.shape
Out[4]:
(5000, 14)
In [5]:
# Checking for null values
df.isnull().sum()
Out[5]:
ID               0
Age              0
Experience       0
Income           0
Zipcode          0
Family           0
CCAvg            0
Education        0
Mortgage         0
AccptPersLoan    0
HaveSecAcct      0
HaveCDAcct       0
HaveOnline       0
HaveCC           0
dtype: int64

Inference: No Null values in the data-set


In [6]:
#Checking for 'zero' values
df[df==0].count()
Out[6]:
ID                  0
Age                 0
Experience         66
Income              0
Zipcode             0
Family              0
CCAvg             106
Education           0
Mortgage         3462
AccptPersLoan    4520
HaveSecAcct      4478
HaveCDAcct       4698
HaveOnline       2016
HaveCC           3530
dtype: int64

Inference: There are a number of 'zero' values. Some of these are valid data - in Mortgage, Personal Loan, Securities Account, CD Account, Online, CreditCard.
Need to investigate if 'zero' in Experience and in 'CCAvg' is valid or if they need to be imputed.


In [7]:
#Statistical Summary
df.describe().T
Out[7]:
count mean std min 25% 50% 75% max
ID 5,000.00 2,500.50 1,443.52 1.00 1,250.75 2,500.50 3,750.25 5,000.00
Age 5,000.00 45.34 11.46 23.00 35.00 45.00 55.00 67.00
Experience 5,000.00 20.10 11.47 -3.00 10.00 20.00 30.00 43.00
Income 5,000.00 73.77 46.03 8.00 39.00 64.00 98.00 224.00
Zipcode 5,000.00 93,152.50 2,121.85 9,307.00 91,911.00 93,437.00 94,608.00 96,651.00
Family 5,000.00 2.40 1.15 1.00 1.00 2.00 3.00 4.00
CCAvg 5,000.00 1.94 1.75 0.00 0.70 1.50 2.50 10.00
Education 5,000.00 1.88 0.84 1.00 1.00 2.00 3.00 3.00
Mortgage 5,000.00 56.50 101.71 0.00 0.00 0.00 101.00 635.00
AccptPersLoan 5,000.00 0.10 0.29 0.00 0.00 0.00 0.00 1.00
HaveSecAcct 5,000.00 0.10 0.31 0.00 0.00 0.00 0.00 1.00
HaveCDAcct 5,000.00 0.06 0.24 0.00 0.00 0.00 0.00 1.00
HaveOnline 5,000.00 0.60 0.49 0.00 0.00 1.00 1.00 1.00
HaveCC 5,000.00 0.29 0.46 0.00 0.00 0.00 1.00 1.00

Inference:

  1. ID seems to be unique, no duplicates
  2. Experience has a 'negative' value - which implies data issues, which need to be resolved
  3. Mortgage column has outliers that need to be addressed
  4. Experience, Zipcode, Education, Personal Loan, Securities Account, CD Account, Online, CreditCard are all categorical variables
  5. Zipcode has data-entry issues with a 4 digit zipcode showing as min value
In [8]:
# Checking for skewness of the data
df.skew()
Out[8]:
ID                0.00
Age              -0.03
Experience       -0.03
Income            0.84
Zipcode         -12.50
Family            0.16
CCAvg             1.60
Education         0.23
Mortgage          2.10
AccptPersLoan     2.74
HaveSecAcct       2.59
HaveCDAcct        3.69
HaveOnline       -0.39
HaveCC            0.90
dtype: float64
In [9]:
# Visual analysis of data skewness
df[['Age','Experience','Income','Zipcode','Family','CCAvg','Education','Mortgage']].hist(bins=25,figsize=[25,15])
Out[9]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000027A537BE320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000027A53741908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000027A533062E8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000027A53335C18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000027A536755F8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000027A53E77F98>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000027A53EAC978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000027A53EEA390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000027A53EEA3C8>]],
      dtype=object)

Inference:

  1. Zipcode with that 1 incorrect data is skewed to the left by a large extent
  2. CCAvg is skewed right
  3. Income is skewed right
  4. Rest of the continous values seem to be normally distributed

EDA

  1. Number of unique in each column?
  2. Number of people with zero mortgage?
  3. Number of people with zero credit card spending per month?
  4. Value counts of all categorical columns.
  5. Univariate and Bivariate
  6. Get data model ready
In [10]:
# Number of unique in each column?
df.nunique()
Out[10]:
ID               5000
Age                45
Experience         47
Income            162
Zipcode           467
Family              4
CCAvg             108
Education           3
Mortgage          347
AccptPersLoan       2
HaveSecAcct         2
HaveCDAcct          2
HaveOnline          2
HaveCC              2
dtype: int64
In [11]:
# Number of people with zero mortgage?

print('Number of people with zero mortgage : ',df.ID[df.Mortgage == 0].count())
Number of people with zero mortgage :  3462
In [12]:
# Number of people with zero credit card spending per month?

print('Number of people with zero credit card spending per month : ',df.ID[df.CCAvg == 0].count())
Number of people with zero credit card spending per month :  106
In [13]:
# Value counts of all categorical columns.

for col in ['AccptPersLoan','HaveSecAcct','HaveCDAcct','HaveOnline','HaveCC','Family','Education','Experience','Zipcode'] :
    print('Value Counts of {} \n{}\n\n'.format(col,df[col].value_counts()))
Value Counts of AccptPersLoan 
0    4520
1     480
Name: AccptPersLoan, dtype: int64


Value Counts of HaveSecAcct 
0    4478
1     522
Name: HaveSecAcct, dtype: int64


Value Counts of HaveCDAcct 
0    4698
1     302
Name: HaveCDAcct, dtype: int64


Value Counts of HaveOnline 
1    2984
0    2016
Name: HaveOnline, dtype: int64


Value Counts of HaveCC 
0    3530
1    1470
Name: HaveCC, dtype: int64


Value Counts of Family 
1    1472
2    1296
4    1222
3    1010
Name: Family, dtype: int64


Value Counts of Education 
1    2096
3    1501
2    1403
Name: Education, dtype: int64


Value Counts of Experience 
 32    154
 20    148
 9     147
 5     146
 23    144
 35    143
 25    142
 28    138
 18    137
 19    135
 26    134
 24    131
 3     129
 14    127
 16    127
 30    126
 34    125
 27    125
 17    125
 29    124
 22    124
 7     121
 8     119
 6     119
 15    119
 10    118
 33    117
 13    117
 11    116
 37    116
 36    114
 21    113
 4     113
 31    104
 12    102
 38     88
 39     85
 2      85
 1      74
 0      66
 40     57
 41     43
-1      33
-2      15
 42      8
-3       4
 43      3
Name: Experience, dtype: int64


Value Counts of Zipcode 
94720    169
94305    127
95616    116
90095     71
93106     57
92037     54
93943     54
91320     53
91711     52
94025     52
92093     51
90245     50
90024     50
90089     46
91330     46
92121     45
94304     45
94143     37
95051     34
94608     34
92182     32
92028     32
92521     32
95054     31
95814     30
95014     29
94542     27
94301     27
94550     27
93407     26
        ... 
95482      2
91129      2
95307      2
95842      2
91784      2
96094      2
92116      2
94507      2
94575      2
90745      2
93033      2
94019      2
92161      2
91941      2
94116      2
95816      2
91326      2
92705      2
94604      2
90813      1
94404      1
94965      1
9307       1
91024      1
92694      1
96145      1
94970      1
94598      1
90068      1
94087      1
Name: Zipcode, Length: 467, dtype: int64


Univariate and Bivariate
In [14]:
dfviz = df[['Age','Experience','Income','CCAvg','Mortgage','AccptPersLoan']].copy()
plt.figure(figsize=[30,20])
sns.set(font_scale=1.15)
# sns.pairplot(dfviz)
sns.pairplot(dfviz, hue='AccptPersLoan')
plt.show()
<Figure size 2160x1440 with 0 Axes>
In [15]:
plt.figure(figsize=[20,15])
sns.set(font_scale=1.15)
sns.heatmap(df.corr(),annot=True,vmin=-1,vmax=1,center=0, linewidth=0.2, fmt='.2f')
plt.show()

Inference:

  1. Obviously Age and Experience have a very strong positive linear relationship
  2. Income and CCAvg have a positive linear relationship (as expected)
  3. With regards to Accepting Personal Loan, from this visualition and with the data included it appears Income and CCAvg seems to be having a positive linear relationship
    • Meaning those who have higher income and having higher CCAvg seems to be accepting personal loans (habitual?!)


In [16]:
corr = df.drop('AccptPersLoan',axis=1).corr()
plt.figure(figsize=[20,15])
sns.set(font_scale=1.15)
sns.heatmap(corr,annot=True,vmin=-1,vmax=1,center=0, linewidth=0.2, fmt='.2f', cmap='coolwarm')
plt.show()

Inference:

  1. Comparing the features (dropping the Dependent variable), the positive linear relationship shows but not very strong
  2. Interestingly Education and CCAvg have a negative linear relationship! (maybe data issues or outliers)
  3. Similartly Eduation and Incomg have a negative linear relationship! (maybe data issues or outliers)
In [17]:
# Comparing the distribution of Age in relation to acceptance of Loan
fig = px.violin(df, color='AccptPersLoan', y='Experience', width=1000)
fig.show()

Inference: Given the shape of the violin is similar between the two categories, there isn't any direct inference if age influences acceptance of the loans


In [18]:
fig = px.violin(df, color='AccptPersLoan', y='CCAvg', width=1000)
fig.show()

Inference: Individuals with lower CC Average tend to NOT accept personal loans


Getting Data-set ready

In [19]:
# Dropping ID column since that is an identifier and not a feature
df.drop('ID', axis=1, inplace=True)
In [20]:
# Dropping observations with incorrect data (with outlier 4 digit Zipcode)
df.drop(df[df.Zipcode < 90000].index, inplace=True)
In [21]:
# Converting zipcode to bins
ziplabels = ['900-905','905-910','910-915','915-920','920-925','925-930','930-935','935-940','940-945','945-950','950-955','955-960','960+']
df['Zipcode'] = pd.cut(df.Zipcode, bins=[90000, 90500, 91000, 91500, 92000, 92500, 93000, 93500, 94000, 94500, 95000, 95500, 96000,100000], labels=ziplabels)
df.head()
Out[21]:
Age Experience Income Zipcode Family CCAvg Education Mortgage AccptPersLoan HaveSecAcct HaveCDAcct HaveOnline HaveCC
0 25 1 49 910-915 4 1.60 1 0 0 1 0 0 0
1 45 19 34 900-905 3 1.50 1 0 0 1 0 0 0
2 39 15 11 945-950 1 1.00 1 0 0 0 0 0 0
3 35 9 100 940-945 1 2.70 2 0 0 0 0 0 0
4 35 8 45 910-915 4 1.00 2 0 0 0 0 0 1
In [22]:
# Checking for acceptance of loans across the zipcodes

fig = px.histogram(df, x='Zipcode', color='AccptPersLoan', barmode='group',
                  category_orders={'Zipcode': ziplabels},
                   width=1000)
fig.show()
In [23]:
# Dropping observations with incorrect data (with Experience less than zero)
df.drop(df[df.Experience < 0 ].index, inplace=True)
In [24]:
# Checking distribution of CC Average
fig = px.histogram(df, x='CCAvg', marginal='rug', width=1000)
fig.show()
In [25]:
# Converting CC Average to bins - given it is already represented in '000s'
cclabels = ['0-1K','1-2K','2-3K','3-4K','4-5K','5-8K','8K+']
df['CCAvg'] = pd.cut(df.CCAvg, bins=[0, 1, 2, 3, 4, 5, 8,12], labels=cclabels)
df.head()
Out[25]:
Age Experience Income Zipcode Family CCAvg Education Mortgage AccptPersLoan HaveSecAcct HaveCDAcct HaveOnline HaveCC
0 25 1 49 910-915 4 1-2K 1 0 0 1 0 0 0
1 45 19 34 900-905 3 1-2K 1 0 0 1 0 0 0
2 39 15 11 945-950 1 0-1K 1 0 0 0 0 0 0
3 35 9 100 940-945 1 2-3K 2 0 0 0 0 0 0
4 35 8 45 910-915 4 0-1K 2 0 0 0 0 0 1
In [26]:
df.Family = df.Family.astype('category')
df.Education = df.Education.astype('category')
In [27]:
df = pd.get_dummies(df)
In [28]:
df.describe().T
Out[28]:
count mean std min 25% 50% 75% max
Age 4,947.00 45.56 11.32 24.00 36.00 46.00 55.00 67.00
Experience 4,947.00 20.33 11.31 0.00 10.50 20.00 30.00 43.00
Income 4,947.00 73.83 46.11 8.00 39.00 64.00 98.00 224.00
Mortgage 4,947.00 56.65 101.84 0.00 0.00 0.00 101.00 635.00
AccptPersLoan 4,947.00 0.10 0.30 0.00 0.00 0.00 0.00 1.00
HaveSecAcct 4,947.00 0.10 0.31 0.00 0.00 0.00 0.00 1.00
HaveCDAcct 4,947.00 0.06 0.24 0.00 0.00 0.00 0.00 1.00
HaveOnline 4,947.00 0.60 0.49 0.00 0.00 1.00 1.00 1.00
HaveCC 4,947.00 0.29 0.46 0.00 0.00 0.00 1.00 1.00
Zipcode_900-905 4,947.00 0.12 0.32 0.00 0.00 0.00 0.00 1.00
Zipcode_905-910 4,947.00 0.03 0.16 0.00 0.00 0.00 0.00 1.00
Zipcode_910-915 4,947.00 0.07 0.25 0.00 0.00 0.00 0.00 1.00
Zipcode_915-920 4,947.00 0.04 0.21 0.00 0.00 0.00 0.00 1.00
Zipcode_920-925 4,947.00 0.12 0.33 0.00 0.00 0.00 0.00 1.00
Zipcode_925-930 4,947.00 0.07 0.26 0.00 0.00 0.00 0.00 1.00
Zipcode_930-935 4,947.00 0.05 0.22 0.00 0.00 0.00 0.00 1.00
Zipcode_935-940 4,947.00 0.03 0.18 0.00 0.00 0.00 0.00 1.00
Zipcode_940-945 4,947.00 0.16 0.36 0.00 0.00 0.00 0.00 1.00
Zipcode_945-950 4,947.00 0.14 0.35 0.00 0.00 0.00 0.00 1.00
Zipcode_950-955 4,947.00 0.08 0.28 0.00 0.00 0.00 0.00 1.00
Zipcode_955-960 4,947.00 0.08 0.27 0.00 0.00 0.00 0.00 1.00
Zipcode_960+ 4,947.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00
Family_1 4,947.00 0.30 0.46 0.00 0.00 0.00 1.00 1.00
Family_2 4,947.00 0.26 0.44 0.00 0.00 0.00 1.00 1.00
Family_3 4,947.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
Family_4 4,947.00 0.24 0.43 0.00 0.00 0.00 0.00 1.00
CCAvg_0-1K 4,947.00 0.36 0.48 0.00 0.00 0.00 1.00 1.00
CCAvg_1-2K 4,947.00 0.27 0.44 0.00 0.00 0.00 1.00 1.00
CCAvg_2-3K 4,947.00 0.18 0.38 0.00 0.00 0.00 0.00 1.00
CCAvg_3-4K 4,947.00 0.06 0.24 0.00 0.00 0.00 0.00 1.00
CCAvg_4-5K 4,947.00 0.04 0.20 0.00 0.00 0.00 0.00 1.00
CCAvg_5-8K 4,947.00 0.06 0.24 0.00 0.00 0.00 0.00 1.00
CCAvg_8K+ 4,947.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00
Education_1 4,947.00 0.42 0.49 0.00 0.00 0.00 1.00 1.00
Education_2 4,947.00 0.28 0.45 0.00 0.00 0.00 1.00 1.00
Education_3 4,947.00 0.30 0.46 0.00 0.00 0.00 1.00 1.00
In [29]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4947 entries, 0 to 4999
Data columns (total 36 columns):
Age                4947 non-null int64
Experience         4947 non-null int64
Income             4947 non-null int64
Mortgage           4947 non-null int64
AccptPersLoan      4947 non-null int64
HaveSecAcct        4947 non-null int64
HaveCDAcct         4947 non-null int64
HaveOnline         4947 non-null int64
HaveCC             4947 non-null int64
Zipcode_900-905    4947 non-null uint8
Zipcode_905-910    4947 non-null uint8
Zipcode_910-915    4947 non-null uint8
Zipcode_915-920    4947 non-null uint8
Zipcode_920-925    4947 non-null uint8
Zipcode_925-930    4947 non-null uint8
Zipcode_930-935    4947 non-null uint8
Zipcode_935-940    4947 non-null uint8
Zipcode_940-945    4947 non-null uint8
Zipcode_945-950    4947 non-null uint8
Zipcode_950-955    4947 non-null uint8
Zipcode_955-960    4947 non-null uint8
Zipcode_960+       4947 non-null uint8
Family_1           4947 non-null uint8
Family_2           4947 non-null uint8
Family_3           4947 non-null uint8
Family_4           4947 non-null uint8
CCAvg_0-1K         4947 non-null uint8
CCAvg_1-2K         4947 non-null uint8
CCAvg_2-3K         4947 non-null uint8
CCAvg_3-4K         4947 non-null uint8
CCAvg_4-5K         4947 non-null uint8
CCAvg_5-8K         4947 non-null uint8
CCAvg_8K+          4947 non-null uint8
Education_1        4947 non-null uint8
Education_2        4947 non-null uint8
Education_3        4947 non-null uint8
dtypes: int64(9), uint8(27)
memory usage: 516.9 KB
Split the data into training and test set in the ratio of 70:30 respectively (5 marks)
In [30]:
dfy = df[['AccptPersLoan']].copy()
dfX = df.drop('AccptPersLoan',axis=1).copy()
In [31]:
X_train, X_test, y_train, y_test = tts(dfX, dfy, test_size=0.3, random_state=1)
In [32]:
X_test.describe().T
Out[32]:
count mean std min 25% 50% 75% max
Age 1,485.00 45.86 11.21 24.00 36.00 46.00 55.00 67.00
Experience 1,485.00 20.63 11.21 0.00 11.00 21.00 30.00 43.00
Income 1,485.00 72.99 45.22 8.00 39.00 63.00 94.00 224.00
Mortgage 1,485.00 56.93 101.27 0.00 0.00 0.00 103.00 617.00
HaveSecAcct 1,485.00 0.10 0.30 0.00 0.00 0.00 0.00 1.00
HaveCDAcct 1,485.00 0.07 0.25 0.00 0.00 0.00 0.00 1.00
HaveOnline 1,485.00 0.59 0.49 0.00 0.00 1.00 1.00 1.00
HaveCC 1,485.00 0.28 0.45 0.00 0.00 0.00 1.00 1.00
Zipcode_900-905 1,485.00 0.10 0.30 0.00 0.00 0.00 0.00 1.00
Zipcode_905-910 1,485.00 0.03 0.16 0.00 0.00 0.00 0.00 1.00
Zipcode_910-915 1,485.00 0.06 0.24 0.00 0.00 0.00 0.00 1.00
Zipcode_915-920 1,485.00 0.04 0.20 0.00 0.00 0.00 0.00 1.00
Zipcode_920-925 1,485.00 0.13 0.33 0.00 0.00 0.00 0.00 1.00
Zipcode_925-930 1,485.00 0.08 0.27 0.00 0.00 0.00 0.00 1.00
Zipcode_930-935 1,485.00 0.05 0.22 0.00 0.00 0.00 0.00 1.00
Zipcode_935-940 1,485.00 0.04 0.19 0.00 0.00 0.00 0.00 1.00
Zipcode_940-945 1,485.00 0.16 0.36 0.00 0.00 0.00 0.00 1.00
Zipcode_945-950 1,485.00 0.14 0.35 0.00 0.00 0.00 0.00 1.00
Zipcode_950-955 1,485.00 0.08 0.28 0.00 0.00 0.00 0.00 1.00
Zipcode_955-960 1,485.00 0.08 0.28 0.00 0.00 0.00 0.00 1.00
Zipcode_960+ 1,485.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00
Family_1 1,485.00 0.31 0.46 0.00 0.00 0.00 1.00 1.00
Family_2 1,485.00 0.25 0.43 0.00 0.00 0.00 0.00 1.00
Family_3 1,485.00 0.20 0.40 0.00 0.00 0.00 0.00 1.00
Family_4 1,485.00 0.24 0.43 0.00 0.00 0.00 0.00 1.00
CCAvg_0-1K 1,485.00 0.37 0.48 0.00 0.00 0.00 1.00 1.00
CCAvg_1-2K 1,485.00 0.27 0.44 0.00 0.00 0.00 1.00 1.00
CCAvg_2-3K 1,485.00 0.18 0.38 0.00 0.00 0.00 0.00 1.00
CCAvg_3-4K 1,485.00 0.06 0.23 0.00 0.00 0.00 0.00 1.00
CCAvg_4-5K 1,485.00 0.04 0.20 0.00 0.00 0.00 0.00 1.00
CCAvg_5-8K 1,485.00 0.05 0.23 0.00 0.00 0.00 0.00 1.00
CCAvg_8K+ 1,485.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00
Education_1 1,485.00 0.42 0.49 0.00 0.00 0.00 1.00 1.00
Education_2 1,485.00 0.27 0.45 0.00 0.00 0.00 1.00 1.00
Education_3 1,485.00 0.30 0.46 0.00 0.00 0.00 1.00 1.00

Modeling

Use the Logistic Regression model to predict whether the customer will take a personal loan or not.
Print all the metrics related to evaluating the model performance (accuracy, recall, precision, f1score, and roc_auc_score).
Draw a heatmap to display confusion matrix (15 marks)

Iterarion 0

In [33]:
logit.fit(X_train,y_train)
print('Training score : ',logit.score(X_train, y_train))

logit.fit(X_test,y_test)
print('Test score : ',logit.score(X_test, y_test))
Training score :  0.9647602541883304
Test score :  0.9656565656565657
In [34]:
y_predict = logit.predict(X_test)
cmat = mtr.confusion_matrix(y_test,y_predict, labels=[1,0])

dfcm = pd.DataFrame(cmat, index=['1','0'],columns=['1','0'])
plt.figure(figsize=[5,3])
sns.set(font_scale=1.2)
fig = sns.heatmap(dfcm, annot=True, fmt='d',linewidth=0.5, cbar=False)
plt.tick_params(axis='both', which='major', labelbottom = False, bottom=False, top = False, labeltop=True)
plt.ylabel('Predicted\n')
plt.show()
In [35]:
npcm = np.array(cmat)
print('Accuracy of the model : ',npcm.trace()/npcm.sum())
print('Recall of the model : ',npcm[0,0]/npcm[:,0].sum())
print('Precision of the model : ',npcm[0,0]/npcm[0].sum())
Accuracy of the model :  0.9656565656565657
Recall of the model :  0.8775510204081632
Precision of the model :  0.688
In [36]:
appr0 = np.array([[npcm.trace()/npcm.sum()],[npcm[0,0]/npcm[:,0].sum()],[npcm[0,0]/npcm[0].sum()]])
appr0
Out[36]:
array([[0.96565657],
       [0.87755102],
       [0.688     ]])

Iteration 1

In [37]:
df1 = df.copy()
In [38]:
fig = px.histogram(df1[df1.Mortgage > 0 ], x='Mortgage', marginal='rug', width=1000)
fig.show()
In [39]:
# Converting Mortgage to bins
mortgagelabels = ['Zero','0-100K','100-150K','150-200K','200-250K','250-300K','300-400K','400K+']
df1.Mortgage = pd.cut(df1.Mortgage, bins=[-1,0,100, 150, 200, 250, 300, 400, 1000], labels=mortgagelabels)

df1.Mortgage.value_counts(normalize=True)*100
Out[39]:
Zero       69.15
100-150K    9.48
150-200K    5.80
0-100K      5.62
200-250K    3.96
300-400K    2.55
250-300K    1.96
400K+       1.48
Name: Mortgage, dtype: float64
In [40]:
df1 = pd.get_dummies(df1)
In [41]:
df1.HaveSecAcct = df1.HaveSecAcct.astype('category')
df1.HaveCDAcct = df1.HaveCDAcct.astype('category')
df1.HaveOnline = df1.HaveOnline.astype('category')
df1.HaveCC = df1.HaveCC.astype('category')
In [42]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4947 entries, 0 to 4999
Data columns (total 43 columns):
Age                  4947 non-null int64
Experience           4947 non-null int64
Income               4947 non-null int64
AccptPersLoan        4947 non-null int64
HaveSecAcct          4947 non-null category
HaveCDAcct           4947 non-null category
HaveOnline           4947 non-null category
HaveCC               4947 non-null category
Zipcode_900-905      4947 non-null uint8
Zipcode_905-910      4947 non-null uint8
Zipcode_910-915      4947 non-null uint8
Zipcode_915-920      4947 non-null uint8
Zipcode_920-925      4947 non-null uint8
Zipcode_925-930      4947 non-null uint8
Zipcode_930-935      4947 non-null uint8
Zipcode_935-940      4947 non-null uint8
Zipcode_940-945      4947 non-null uint8
Zipcode_945-950      4947 non-null uint8
Zipcode_950-955      4947 non-null uint8
Zipcode_955-960      4947 non-null uint8
Zipcode_960+         4947 non-null uint8
Family_1             4947 non-null uint8
Family_2             4947 non-null uint8
Family_3             4947 non-null uint8
Family_4             4947 non-null uint8
CCAvg_0-1K           4947 non-null uint8
CCAvg_1-2K           4947 non-null uint8
CCAvg_2-3K           4947 non-null uint8
CCAvg_3-4K           4947 non-null uint8
CCAvg_4-5K           4947 non-null uint8
CCAvg_5-8K           4947 non-null uint8
CCAvg_8K+            4947 non-null uint8
Education_1          4947 non-null uint8
Education_2          4947 non-null uint8
Education_3          4947 non-null uint8
Mortgage_Zero        4947 non-null uint8
Mortgage_0-100K      4947 non-null uint8
Mortgage_100-150K    4947 non-null uint8
Mortgage_150-200K    4947 non-null uint8
Mortgage_200-250K    4947 non-null uint8
Mortgage_250-300K    4947 non-null uint8
Mortgage_300-400K    4947 non-null uint8
Mortgage_400K+       4947 non-null uint8
dtypes: category(4), int64(4), uint8(35)
memory usage: 382.0 KB
In [43]:
dfy1 = df1[['AccptPersLoan']].copy()
dfX1 = df1.drop('AccptPersLoan',axis=1).copy()
In [44]:
X1_train, X1_test, y1_train, y1_test = tts(dfX1, dfy1, test_size=0.3, random_state=30)
In [45]:
y1_train.AccptPersLoan.value_counts(normalize=True)*100
Out[45]:
0   89.83
1   10.17
Name: AccptPersLoan, dtype: float64
In [46]:
logit.fit(X1_train,y1_train)
print('Training score : ',logit.score(X1_train, y1_train))

logit.fit(X1_test,y1_test)
print('Test score : ',logit.score(X1_test, y1_test))
Training score :  0.963893703061814
Test score :  0.9730639730639731
In [47]:
y1_predict = logit.predict(X1_test)
cmat1 = mtr.confusion_matrix(y1_test,y1_predict, labels=[1,0])

dfcm1 = pd.DataFrame(cmat1, index=['1','0'],columns=['1','0'])
plt.figure(figsize=[5,3])
sns.set(font_scale=1.2)
fig = sns.heatmap(dfcm1, annot=True, fmt='d',linewidth=0.5, cbar=False)
plt.tick_params(axis='both', which='major', labelbottom = False, bottom=False, top = False, labeltop=True)
plt.ylabel('Predicted\n')
plt.show()
In [48]:
npcm1 = np.array(cmat1)
print('Accuracy of the model : ',npcm1.trace()/npcm1.sum())
print('Recall of the model : ',npcm1[0,0]/npcm1[:,0].sum())
print('Precision of the model : ',npcm1[0,0]/npcm1[0].sum())
Accuracy of the model :  0.9730639730639731
Recall of the model :  0.94
Precision of the model :  0.734375
In [49]:
appr1 = np.array([[npcm1.trace()/npcm1.sum()],[npcm1[0,0]/npcm1[:,0].sum()],[npcm1[0,0]/npcm1[0].sum()]])
appr1
Out[49]:
array([[0.97306397],
       [0.94      ],
       [0.734375  ]])

Iteration 2

In [50]:
df2 = df1.copy()
In [51]:
df2['Income_Log'] = np.log(df2.Income)

fig = px.box(df2, x='Income', width=1000, height=200)
fig.show()

fig = px.box(df2, x='Income_Log', width=1000, height=200)
fig.show()
In [52]:
df2.drop('Income',axis=1,inplace=True)
dfy2 = df2[['AccptPersLoan']].copy()
dfX2 = df2.drop('AccptPersLoan',axis=1).copy()
In [53]:
X2_train, X2_test, y2_train, y2_test = tts(dfX2, dfy2, test_size=0.3, random_state=30)
In [54]:
y2_train.AccptPersLoan.value_counts(normalize=True)*100
Out[54]:
0   89.83
1   10.17
Name: AccptPersLoan, dtype: float64
In [55]:
logit.fit(X2_train,y2_train)
print('Training score : ',logit.score(X2_train, y2_train))

logit.fit(X2_test,y2_test)
print('Test score : ',logit.score(X2_test, y2_test))
Training score :  0.9682264586943963
Test score :  0.9744107744107744
In [56]:
y2_predict = logit.predict(X2_test)
cmat2 = mtr.confusion_matrix(y2_test,y2_predict, labels=[1,0])

dfcm2 = pd.DataFrame(cmat2, index=['1','0'],columns=['1','0'])
plt.figure(figsize=[5,3])
sns.set(font_scale=1.2)
fig = sns.heatmap(dfcm2, annot=True, fmt='d',linewidth=0.5, cbar=False)
plt.tick_params(axis='both', which='major', labelbottom = False, bottom=False, top = False, labeltop=True)
plt.ylabel('Predicted\n')
plt.show()
In [57]:
npcm2 = np.array(cmat2)
print('Accuracy of the model : ',npcm2.trace()/npcm2.sum())
print('Recall of the model : ',npcm2[0,0]/npcm2[:,0].sum())
print('Precision of the model : ',npcm2[0,0]/npcm2[0].sum())
Accuracy of the model :  0.9744107744107744
Recall of the model :  0.96875
Precision of the model :  0.7265625
In [58]:
appr2 = np.array([[npcm2.trace()/npcm2.sum()],[npcm2[0,0]/npcm2[:,0].sum()],[npcm2[0,0]/npcm2[0].sum()]])
appr2
Out[58]:
array([[0.97441077],
       [0.96875   ],
       [0.7265625 ]])

Iteration 3 (using Logit models with class_weight)

In [59]:
df3 = df2.copy()
In [60]:
dfy3 = df3[['AccptPersLoan']].copy()
dfX3 = df3.drop('AccptPersLoan',axis=1).copy()
In [61]:
X3_train, X3_test, y3_train, y3_test = tts(dfX3, dfy3, test_size=0.3, random_state=30)
In [62]:
y3_train.AccptPersLoan.value_counts(normalize=True)*100
Out[62]:
0   89.83
1   10.17
Name: AccptPersLoan, dtype: float64
In [63]:
logit = LogisticRegression(solver='newton-cg', class_weight='balanced')
logit.fit(X3_train,y3_train)
print('Training score : ',logit.score(X3_train, y3_train))

logit.fit(X3_test,y3_test)
print('Test score : ',logit.score(X3_test, y3_test))
Training score :  0.9182553437319468
Test score :  0.94006734006734
In [64]:
y3_predict = logit.predict(X3_test)
cmat3 = mtr.confusion_matrix(y3_test,y3_predict, labels=[1,0])

dfcm3 = pd.DataFrame(cmat3, index=['1','0'],columns=['1','0'])
plt.figure(figsize=[5,3])
sns.set(font_scale=1.2)
fig = sns.heatmap(dfcm3, annot=True, fmt='d',linewidth=0.5, cbar=False)
plt.tick_params(axis='both', which='major', labelbottom = False, bottom=False, top = False, labeltop=True)
plt.ylabel('Predicted\n')
plt.show()
In [65]:
npcm3 = np.array(cmat3)
print('Accuracy of the model : ',npcm3.trace()/npcm3.sum())
print('Recall of the model : ',npcm3[0,0]/npcm3[:,0].sum())
print('Precision of the model : ',npcm3[0,0]/npcm3[0].sum())
Accuracy of the model :  0.94006734006734
Recall of the model :  0.5933014354066986
Precision of the model :  0.96875
In [66]:
appr3 = np.array([[npcm3.trace()/npcm3.sum()],[npcm3[0,0]/npcm3[:,0].sum()],[npcm3[0,0]/npcm3[0].sum()]])
appr3
Out[66]:
array([[0.94006734],
       [0.59330144],
       [0.96875   ]])

Summary:

Comparing the results of the iterations:

In [67]:
compare = pd.DataFrame('',index=['Accuracy','Recall','Precision'],columns=['Iteration0','Iteration1','Iteration2','Iteration3'])
In [68]:
compare.Iteration0 = appr0
compare.Iteration1 = appr1
compare.Iteration2 = appr2
compare.Iteration3 = appr3
pd.options.display.float_format = '{:,.4f}'.format  
compare
Out[68]:
Iteration0 Iteration1 Iteration2 Iteration3
Accuracy 0.9657 0.9731 0.9744 0.9401
Recall 0.8776 0.9400 0.9688 0.5933
Precision 0.6880 0.7344 0.7266 0.9688
In [69]:
cmpr = compare.T
cmpr = compare.reset_index()
cmpr = cmpr.melt(id_vars='index',value_vars=['Iteration0','Iteration1','Iteration2','Iteration3'], var_name='Iteration',value_name='Value')
# cmpr

fig = px.line(cmpr, x='Iteration', y='Value', color='index', range_y=[0.5,1.1], width=1000, height=600)
fig.data[0].update(mode='markers+lines')
fig.data[1].update(mode='markers+lines')
fig.data[2].update(mode='markers+lines')
fig.show()

Comparing...

  1. In this business context of the need to maximize the conversion (correctly identifying the individuals most likely to accept loans), it is preferable to have higher True Positives
  2. Meaning, it is preferable to reach out to the customers who are likely to accept personal loans

...hence the model with the 'balanced' class_weight is the right fit for this usecase with the True Positive of 124 and Precision of 0.9688



Find out coefficients of all the attributes and show the output in a data frame with column names?
For test data show all the rows where the predicted class is not equal to the observed class. (10 marks)
In [70]:
# Attributes and Co-efficients
dfcoef = pd.DataFrame({'Attributes':list(X3_test.columns),'Coefficient':list(logit.coef_[0])})
dfcoef.sort_values('Coefficient', ascending=False)
Out[70]:
Attributes Coefficient
41 Income_Log 6.9085
3 HaveCDAcct 4.2256
26 CCAvg_3-4K 1.7971
22 Family_4 1.4717
32 Education_3 1.4604
31 Education_2 0.9885
27 CCAvg_4-5K 0.9549
39 Mortgage_300-400K 0.7494
21 Family_3 0.5690
7 Zipcode_905-910 0.5249
16 Zipcode_950-955 0.4498
28 CCAvg_5-8K 0.3729
14 Zipcode_940-945 0.3186
11 Zipcode_925-930 0.3044
12 Zipcode_930-935 0.3037
38 Mortgage_250-300K 0.1962
0 Age 0.1514
10 Zipcode_920-925 0.1132
6 Zipcode_900-905 0.0169
15 Zipcode_945-950 0.0009
36 Mortgage_150-200K -0.0559
25 CCAvg_2-3K -0.0575
9 Zipcode_915-920 -0.0676
40 Mortgage_400K+ -0.1128
35 Mortgage_100-150K -0.1326
37 Mortgage_200-250K -0.1395
17 Zipcode_955-960 -0.1509
34 Mortgage_0-100K -0.1625
1 Experience -0.1629
13 Zipcode_935-940 -0.3366
33 Mortgage_Zero -0.3423
23 CCAvg_0-1K -0.3484
4 HaveOnline -0.4809
18 Zipcode_960+ -0.5299
20 Family_2 -0.8093
29 CCAvg_8K+ -0.9023
8 Zipcode_910-915 -0.9474
24 CCAvg_1-2K -0.9625
2 HaveSecAcct -1.0457
5 HaveCC -1.1588
19 Family_1 -1.2314
30 Education_1 -2.4489
In [71]:
# Showing predicted values different to the Actual values
y3_predict = logit.predict(X3_test)
dfpred = pd.DataFrame({'Actual':(list(y3_test.AccptPersLoan)),'Predicted':list(y3_predict)})
dfpred[dfpred.Actual != dfpred.Predicted]
Out[71]:
Actual Predicted
26 0 1
49 0 1
60 0 1
63 0 1
130 0 1
163 0 1
165 0 1
188 0 1
221 0 1
239 0 1
261 0 1
263 0 1
275 0 1
318 0 1
351 0 1
363 0 1
376 0 1
377 0 1
385 0 1
390 0 1
424 0 1
425 0 1
446 0 1
447 1 0
449 0 1
490 0 1
493 0 1
505 0 1
522 0 1
557 0 1
... ... ...
1041 0 1
1098 0 1
1104 1 0
1120 0 1
1141 0 1
1148 0 1
1152 0 1
1153 0 1
1166 0 1
1172 0 1
1184 0 1
1187 1 0
1205 0 1
1227 0 1
1233 0 1
1234 0 1
1294 0 1
1299 0 1
1313 0 1
1333 0 1
1347 0 1
1349 0 1
1351 0 1
1387 0 1
1395 0 1
1409 0 1
1417 0 1
1473 0 1
1474 0 1
1476 0 1

89 rows × 2 columns

In [72]:
# Summary showing predicted values different to the Actual values
dfpred[dfpred.Actual != dfpred.Predicted].groupby(['Actual','Predicted'])['Predicted'].agg('count')
Out[72]:
Actual  Predicted
0       1            85
1       0             4
Name: Predicted, dtype: int64
In [73]:
dfcoef.sort_values('Coefficient', ascending=False).head()
Out[73]:
Attributes Coefficient
41 Income_Log 6.9085
3 HaveCDAcct 4.2256
26 CCAvg_3-4K 1.7971
22 Family_4 1.4717
32 Education_3 1.4604
In [74]:
dfcoef.sort_values('Coefficient').head()
Out[74]:
Attributes Coefficient
30 Education_1 -2.4489
19 Family_1 -1.2314
5 HaveCC -1.1588
2 HaveSecAcct -1.0457
24 CCAvg_1-2K -0.9625
In [75]:
plt.figure(figsize=[5,3])
sns.set(font_scale=1.2)
fig = sns.heatmap(dfcm3, annot=True, fmt='d',linewidth=0.5, cbar=False)
plt.tick_params(axis='both', which='major', labelbottom = False, bottom=False, top = False, labeltop=True)
plt.ylabel('Predicted\n')
plt.show()

Based on the above data-points:

Ranking of factors which contribute to customers more likely to accept Personal Loans

  1. Higher Income - possibly the ability to repay loans quickly
  2. Having a 'Certificate of Deposit' Account (CD Account) - maybe having a safetynet in that deposit to repay for the personal loan
  3. Average CC balance between \$3,000 to \\$4000
  4. Having a large family - More needs than a smaller family
  5. Higher the education (Professional/Graduate) - maybe better understanding of how personal loan works & also the need for additional funds to complete their education!

Ranking of factors which contribute to customers NOT likely to accept Personal Loans

  1. Lower education seems to be the biggest factor (given the data) that negates customers taking loans
  2. Having smaller family - maybe there aren't unplanned expenses hitting them and hence not likely to take loans
  3. Having a Securities Account
  4. Having a CC with Average balance between \$1,000 to \\$2000

Based on the confusion matrix,

  1. The model is built for maximizing the 'True Positives', aka maximizing the return on the effort/investments in campaigns to offer personal loans to the right segment of customers
  2. There is also the lost opportunity with some customers who are likely to accept loans, who has been classified by this model as not likely to accept loans.
  3. Depending on the Bank's needs & priority, the model can be adjusted, thresholds adjusted to tilt the model to maximize, 'True Positives' or to minimize 'False Positives' and 'False Negatives'